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The current trend of multicore architectures on shared memory systems un- 
derscores the need of parallehsm. While there are some programming model to 
express parallelism, thread programming model has become a standard to support 
these system such as OpenMP, and POSIX threads. MPI (Message Passing Inter- 
face) which remains the dominant model used in high-performance computing today 
faces this challenge. 

Previous version of MPI which is MPI-1 has no shared memory concept, and 
Current MPI version 2 which is MPI-2 has a limited support for shared memory 
systems. In this research, MPI-2 version of MPI will be compared with OpenMP 
to see how well does MPI perform on multicore / SMP (Symmetric Multiprocessor) 
machines. 

Comparison between OpenMP for thread programming model and MPI for 
message passing programming model will be conducted on multicore shared memory 
machine architectures to see who has a better performance in terms of speed and 
throughput. Application used to assess the scalability of the evaluated parallel 
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programming solutions is matrix multiplication with customizable matrix dimension. 

Many research done on a large scale parallel computing which using high scale 
benchmark such as NSA Parallel Benchmark (NPB) for their testing standarization 
[1]. This research will be conducted on a small scale parallel computing that em- 
phasize more on the performance evaluation between MPI and OpenMPI parallel 
programming model using self created benchmark. 

Bibhography (2007-2010) 



iii 



TABLE OF CONTENT 



ABSTRACT 

TABLE OF CONTENTS 

LIST OF FIGURES 

LIST OF TABLES 

BAB I : INTRODUCTION 

1.1 Background 

1.2 Related Research 

1.3 Research Significance 

1.4 Research Problem 

1.5 Research Purposes 

1.6 Scope of the Research 

BAB II : FUNDAMENTAL THEORY . . . 

2.1 Shared Memory Systems 

2.2 Distributed Memory 

2.3 Distributed Shared Memory 

2.4 Parallel Programming Models 

2.4.1 OpenMP 

2.4.2 MPI 

BAB III : WORKSHARE METHODOLOGY 

3.1 Matrix Multiplication Workshare Algorithm 
3.1.1 MPI Workshare Algorithm 

iv 



3.1.2 OpenMP Workshare Algorithm 

BAB IV : RESULT AND DISCUSSION 

4.1 Performance Test 

4.2 Statistical Analysis Result 

4.2.1 Parallel Runtime Towards Matrix Dimension . . 

4.2.2 Parallel Throughput Towards Matrix Dimension 

4.2.3 Parallel Speedup Towards Matrix Dimension . . 
BAB V : CONCLUDING REMARKS 

5.1 Conclusion 

5.2 Future Work 

BIBLIOGRAPHY 



V 



LIST OF FIGURES 



2.1 Block diagram of a generic, cache-based dual core processor 7 

2.2 An illustration of a distributed memory system of three computers . . 8 

2.3 An illustration of a distributed shared memory system 

2.4 The fork-join programming model supported by OpenMP 

3.1 Matrix Multiplication Structure 

3.2 Matrix A Row Based Division 

3.3 Matrix B Distribution 

3.4 Matrix Multiplication For Each Worker Process 

3.5 OpenMP Matrix Multiplication Algorithm Scheme 

4.1 Running time comparison with various matrix dimension 

4.2 Throughput comparison with various matrix dimension 

4.3 Speed comparison with various matrix dimension 



vi 



LIST OF TABLES 



Running time comparison with various matrix dimension 
Throughput comparison with various matrix dimension 
Speed comparison with various matrix dimension . . . 



vii 



CHAPTER I 
INTRODUCTION 



1.1 Background 

The growth of multicore processors has increased the need for parallel pro- 
grams on the largest to the smallest of systems (clusters to laptops). There are many 
ways to express parallelism in a program. In HPC (High Performance Computing), 
the MPI (Message Passing Interface) has been the main tool for parallel message 
passing programming model of most programmers j9]. 

A multi-core processor looks the same as a multi-socket single-core server 
to the operating system, (i.e. before multi-core, dual socket servers provided two 
processors like today's dual core processors) Programming in this environment is 
essentially a mater of using POSIX threads. 

Thread programming can be difficult and error prone. OpenMP was devel- 
oped to give programmers a higher level of abstraction and make thread program- 
ming easier. Accordance to multicore trend growth, parallel programming using 
OpenMP gains popularity between HPC developers. 

Together with the growth of thread programming model on shared memory 
machines, MPI which has been intended for parallel distributed systems since MPI- 
1, also has improved to support shared memory systems. The principal MPI-1 model 
has no shared memory concept, and MPI-2 has only a limited distributed shared 
memory concept. Nonetheless, MPi programs are regularly run on shared memory 
computers. 
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MPI performance for shared memory systems, will be tested on cluster of 
shared memory machines. OpenMPI will be used as a reference on the same multi- 
core systems with MPI clusters (both MPI and OpenMPI will have an equal amount 
of core workes). 

Application used as a testing is N X N rectangular matrix multiplication 
with adjustable matrix dimension N ranging from 10 to 2000. For OpenMP, a single 
multicore machines with two worker cores will be used to calculated the matrix. 
For the MPI, two multicore machines with three worker cores will be used (one as 
a master process who decompose the matrix to sub - matrix and distrbute it to 
two other worker process and compose the final result matrix from the sub - matrix 
multiplication done by its two worker process). 

Parameter result which can be obtained from this test are amount of floating 
point operation per second (FLOPS) which in this case is matrix multiplication. 
Program Running Time, and Speedup. For MPI, two machines used for testing 
having quite similiar performance (Memory and CPU performance). MPI testing 
done via LAN cable medium transmission to achieve best run time performance and 
minimizing time communication between process. 

1.2 Related Research 

There are already related research topik in this area. One of them did the 
testing by using certain parallel benchmark such as NAS Parallel benchmark (NPB) 
to standarize the evaluation on large clusters and ten to hundreds of CPU cores [3], 
which can produce large speedup and throughput. 
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This research is simpler compared than the previous research which uses 
tens or hundreds machine resources and comphcated benchmark. This research 
focused on how does different parallel programming model can affect the program 
performance. Testing in this research done by using an self created benchmark which 
count the program running time and matrix multiplication operation per second. 

1.3 Research Significance 

This research describes how does workshare process done on different parallel 
programming model. This research gives comparative result between message pass- 
ing and shared memory programming model in runtime and amount of throughput. 
Testing methodology also simple and has high usability on the available resources. 

1.4 Research Problem 

Problem covered in this research are: 

1. How does different parallel programming model influenced parallel perfor- 
mance on different memory architecture? 

2. How does workshare construct differ between shared and distributed shared 
memory systems? 

1.5 Research Purposes 

Objectives of this research are: 
1. Evaluating parallel performance between thread and message passing program- 
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ming model. 

2. Evaluating parallel algorithm workshare between threads and message passing 
programming model. 

1.6 Scope of the Research 

Testing experiment conducted on a single multicore shared memory machine 
which consist of two cores for thread programming model (OpenMP). And two iden- 
tical multicore shared memory machine with two cores on each machine for message 
passing programming model without expressing thread safety level. Matrix Multipli- 
cation Program used for testing also has a limited dimension which is 2000, because 
of the machine power limitation. Testing parameters generated are amount of float- 
ing point operation per second (FLOPS) which in this case is matrix multiplication, 
Program Running Time, and Speedup 



CHAPTER II 
FUNDAMENTAL THEORY 



Parallel computing is a form of computation in which many calculations are carried 
out simultaneously, operating on the principle that large problems can often be 
divided into smaller ones, which are then solved concurrently ("in parallel"). Parallel 
computing is done by a certain amount of parallel computers. Each of parallel 
computer may has different CPU core and memory architecture 

Parallel computers can be roughly classified according to the level at which 
the hardware supports parallelism. Currently there are three types which are shared 
memory (which usually has multiple core processor), distributed memory (clusters, 
MPPs, and grids), and Distributed shared memory (cluster of Shared memory sys- 
tems) . 

2.1 Shared Memory Systems 

In computer hardware, shared memory refers to a (typically) large block of 
random access memory that can be accessed by several different central processing 
units (CPUs) in a multiple-processor computer system. 

A shared-memory parallel computer whose individual processors share mem- 
ory (and I/O) in such a way that each of them can access any memory location with 
the same speed; that is, they have a uniform memory access (UMA) time. 

Each of individual processor in shared memory system has a small and fast 
private cache memory. Cache memory used to supply each core processor with data 
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and instruction at high rates. This is because fetching data done from processor to 
main memory directly is slower than fetching it from cache memory. 

The issue with shared memory systems is that many CPUs need fast access 
to memory and will likely cache memory [H], which has two complications: 

• CPU-to-memory connection becomes a bottleneck. Shared memory computers 
cannot scale very well. Most of them have ten or fewer processors. 

• Cache coherence: Whenever one cache is updated with information that may 
be used by other processors, the change needs to be reflected to the other 
processors, otherwise the different processors will be working with incoherent 
data. Such coherence protocols can, when they work well, provide extremely 
high-performance access to shared information between multiple processors. 
On the other hand they can sometimes become overloaded and become a 
bottleneck to performance. 

However, to avoid memory inconsistency as already mentioned above, there 
is a cache memory which can be shared to all processor. Shared cache memory can 
be used for each core processor to write and read data. Figure |2TT] gives information 
about cache memory. 

2.2 Distributed Memory 

In computer science, distributed memory refers to a multiple-processor com- 
puter system in which each processor has its own private memory. In other words 
each processor will resides on different computer machine. Computational tasks can 
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Figure 2.1: Block diagram of a generic, cache-based dual core processor 



only operate on local data, and if remote data is required, the computational task 
must communicate with one or more remote processors. 

In a distributed memory system there is typically a processor, a memory, and 
some form of interconnection that allows programs on each processor to interact with 
each other. The interconnect can be organised with point to point links or separate 
hardware can provide a switching network. The network topology is a key factor in 
determining how the multi-processor machine scales. The links between nodes can 
be implemented using some standard network protocol (for example Ethernet), etc. 
Fig 12.21 shows distributed memory sytems architecture. 

In contrast, a shared memory multi processor offers a single memory space 
used by all processors. Processors do not have to be aware where data resides, 
except that there may be performance penalties, and that race conditions are to be 
avoided. 
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Figure 2.2: An illustration of a distributed memory system of three computers 

2.3 Distributed Shared Memory 

Distributed Shared Memory (DSM), also known as a distributed global ad- 
dress space (DGAS), is a term in computer science that refers to a wide class of 
software and hardware implementations, in which each node of a cluster has access 
to shared memory in addition to each node's non-shared private memory. 

The shared memory component is usually a cache coherent SMP machine. 
Processors on a given SMP can address that machine's memory as global. The 
distributed memory component is the networking of multiple SMPs. SMPs know 
only about their own memory - not the memory on another SMP. Therefore, network 
communications are required to move data from one SMP to another. Figure 12.31 
describes about distributed shared memory. 
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Figure 2.3: An illustration of a distributed shared memory system 

2.4 Parallel Programming Models 

All of those parallel hardware classification needs programming language 
which has a capabilty to share or divided the work among processors. Concurrent 
programming languages, libraries, APIs, and parallel programming models have 
been created for programming parallel computers. These can generally be divid- 
ed into classes based on the assumptions they make about the underlying memory 
architecture which are shared memory, distributed memory, or shared distributed 
memory. 

A parallel programming model is a set of software technologies to express 
parallel algorithms and match applications with the underlying parallel systems. 
It encloses the areas of applications, programming languages, compilers, libraries, 
communications systems, and parallel I/O. 

Parallel models are implemented in several ways: as libraries invoked from 
traditional sequential languages, as language extensions, or complete new execution 
models. They are also roughly categorized for two kinds of systems: shared-memory 
system and distributed-memory system, though the lines between them are largely 
blurred nowadays. 
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Shared memory programming languages communicate by manipulating shared 
memory variables through threads. Threads used as substasks which carry instruc- 
tion process to one / more core processor, however, one thread only can carry one 
instructions process at a certain time. In other words, multiple threads can car- 
ry multiple instruction process Distributed memory uses message passing. POSIX 
Threads and OpenMP are two of most widely used shared memory APIs, whereas 
Message Passing Interface (MPI) is the most widely used message-passing system 
API. 

Shared memory systems use . whereas distributed memory systems use mes- 
sage passing task and communication carried out by message passing over network 
transmission. 

A programming model is usually judged by its expressibility and simplicity, 
which are by all means conflicting factors. The ultimate goal is to improve produc- 
tivity of programming. 

2.4.1 OpenMP 

OpenMP (Open Multi-Processing) is an application programming interface 
(API) that supports multi-platform shared memory multiprocessing programming 
in C, C++ and Fortran on many architectures, including Unix and Microsoft Win- 
dows platforms. It consists of a set of compiler directives, library routines, and 
environment variables that influence run-time behavior. 

OpenMP is an implementation of multithreading, a method of parallelization 
whereby the master "thread" (a series of instructions executed consecutively) "forks" 
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a specified number of slave "tlireads" and a task is divided among them. The 
threads then run concurrently, with the runtime environment allocating threads to 
different processors. Hence, OpenMP is one of thread based parallel programming 
which will be used in this research. Figure 12.41 gives a better understanding about 
multithreading. 
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Figure 2.4: The fork-join programming model supported by OpenMP 



OpenMP use pragma directives to express parallelism in the code block. Parts 
of the program that are not enclosed by a parallel construct will be executed serially. 
When a thread encounters this construct, a team of threads is created to execute 
the associated parallel region, which is the code dynamically contained within the 
parallel construct. But although this construct ensures that computations are per- 
formed in parallel, it does not distribute the work of the region among the threads 
in a team. In fact, if the programmer does not use the appropriate syntax to specify 
this action, the work will be replicated. At the end of a parallel region, there is an 
implied barrier that forces all threads to wait until the work inside the region has 
been completed. Only the initial thread continues execution after the end of the 
parallel region. 
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The thread that encounters the parallel construct becomes the master of the 
new team. Each thread in the team is assigned a unique thread number (also referred 
to as the "thread id") to identify it. They range from zero (for master thread) up 
to one less than the number of threads within the team, and they can be accessed 
by the programmer. 

2.4.2 MPI 

Message Passing Interface (MPI) is an API specification that allows com- 
puters to communicate with one another. It is used in computer clusters and su- 
percomputers. MPI is a language-independent communications protocol used to 
program parallel computers. Both point-to-point and collective communication are 
supported. MPI "is a message-passing application programmer interface, together 
with protocol and semantic specifications for how its features must behave in any 
implementation. "MPI's goals are high performance, scalability, and portability. 

MPI is not sanctioned by any major standards body; nevertheless, it has 
become a de facto standard for communication among process that model a paral- 
lel program running on a distributed memory system. Actual distributed memory 
supercomputers such as computer clusters often run such programs. The principal 
MPI-1 model has no shared memory concept, and MPI-2 has only a limited dis- 
tributed shared memory concept. Nonetheless, MPI programs are regularly run on 
shared memory computers. 

The MPI interface is meant to provide essential virtual topology, synchro- 
nization, and communication functionality between a set of processes (that have 
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been mapped to nodes/servers/computer instances) in a language- independent way, 
with language-specific syntax (bindings) , plus a few language-specific features. MPI 
programs always work with processes, but programmers commonly refer to the pro- 
cesses as processors. Typically, for maximum performance, each CPU (or core in a 
multi-core machine) will be assigned just a single process. This assignment happens 
at runtime through the agent that starts the MPI program (i.e. MPI daemon), 
normally called mpirun or mpiexec. Computer machine which initates MPI ring 
daemon will has process manager in its core CPU. Process manager identified with 
ID and all of his worker have ID greater than 0. 

The initial implementation of the MPI 1.x standard was MPICH, from Ar- 
gonne National Laboratory (ANL) and Mississippi State University. ANL has con- 
tinued developing MPICH for over a decade, and now offers MPICH 2, implementing 
the MPI-2.1 standard. 



CHAPTER III 
WORKSHARE METHODOLOGY 



3.1 Matrix Multiplication Workshare Algorithm 



Matrix multiplication structure is as defined in Figure 13.11 
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Figure 3.1: Matrix Multiplication Structure 



Multiplying two NxN matrices in sequential algorithm [T] takes obviously for 
each element multiplications and — 1 additions. Since there are A^^ elements 
in the matrix this yields a total of A^^ * {2N — 1) floating-point operations, or about 
2N^ for large A^, that is , 0(A^^). 

Parallel algorithm workshare does not change matriks multiplication arith- 
metic operations. It only change the execution sequence for multiple processors. 
However, the complexity / operation count will change because of the workshare 
between core processors. 
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Algorithm 1 Matrix Multiplication Algorithm 
for i = 1 to n do 
for A; = 1 to / do 
for j = 1 to m do 

c(i, k) ^ c{i, k) + a{i,j) x h{j, k) 
end for 
end for 
end for 

3.1.1 MPI Workshare Algorithm 

MPI shared their work among processing units, by using message passing 
across network. Process identification between core CPU (both on the same com- 
puter and difi^erent computer) is simihar with OpenMP (i.e. ID = master process, 
ID > worker process). 

Parallel proramming model using message passing is similiar with unix socket 
programming in which process manager can send a chunked task to all of his worker 
process and receive computation result from all of his worker. 

Matrix Multiplication between two rectangular matrix can be shared between 
process using a certain rule. One of the rule is to make the master process as the 
process manager which divided and distribute matrix elements according to the 
number of CPU core workers. Thus, if there is 4 process used, 1 will be used as 
process manager and the other 3 will be used as worker process. Process ID will 
become process manager and process ID 1 to 2 as workers. There are several MPI 
model to distribute matrix among the worker processes and One of them is row 
based distribution. 

In row based distribution. For example there are two NxN matrix, A^j and 
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Bij, which will be multiplicated. All of matrix Bij elements (rows and collumns) 
will be sent to all worker processes which is done by master process. For matrix Aij, 
before it is sent, it will be first divided according to its amount of row. For example 
if matrix Aij has 4 rows and there are 3 workers available, each process will has 
4/3 which is 1 row with another 1 residual row value generated from the arithmetic 
division. The residual row value will be added to the worker process which has ID 
lower or equal than amount of residual row value in a for repetition order (start 
from worker ID 1 to 2). Thus. 1 residual row value will be added to worker process 
ID 1. 
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Figure 3.2: Matrix A Row Based Division 
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Figure 13.21 gives row based distribution analogy. Worker process ID 1 work 
on 2 rows because there is a residual value from the arithmetic division operation. 
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Another reason why worker ID 1 which receives extra row is because worker ID 1 is 
the first worker process found on the iteration. 

Offset variable value will be added for each row sent to worker process. Note 
that offset varible has a crucial role for keeping track of matrix A row index so that 
each worker process knows which row index needs to be worked on. While Figure 
13.31 shows how all of rows and collumns of matrix b sent to all worker process. 

Matrix Ci,j with NxN dimensions will be used to hold the matrix result 
elements. The computation of a single j row element Ci,j (for j = 1,2,.. N) requires 
an entire matrix element of Bij and a subset row element of Agj (where g C j), 
respectively. Beacuse each worker process has those required element. For number p 
process used, each p— i worker process can compute row element of C(j/(p_j))+extra,j 
(for j = 1,2,.. N) (where extra is an residual variable value and may has different 
value for different p— i). Matrix computation for each of worker process is shown in 
Figure [3^ 

After the sub process multiplication is done in each process, they will send 
back their matrix result and offset variable value to the master process. For each 
value received from worker process, master process will generate the final matrix 
result according to the number offset variable. 

The most important thing in this matrix operation division is that there is no 
dependent data between each sub process matrix multiplication. Thus, there is not 
any data synchronization needed between sub process, which can reduce program 
complexity. 

MPI Matrix Multiplications algorithms workshare is divided into two parts. 
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Algorithm 2 MPI Master Process Algorithm 

{Master Process} 
if taskid = then 

{Initializing MatrixA} 
for j = to mtrksA_row — 1 do 
for J = to mtrksA_col — 1 do 

a[i][j] ^ RANDQ 
end for 
end for 

{Initializing MatrixB} 

for i = to mtrksA_col — 1 do 

for J = to mtrksB _col — 1 do 
b[i\li]^RANDQ 

end for 
end for 

{Start Counting Running Time} 
t_time MPI_Wtime() 
{Send matrix data to the worker tasks} 
averow <— mtrksA _row / numworkers 
extra <— TntrksA_row%numworkers 
offset ^ 

{Set the message tag type with value 1, mark the message sent by master process} 

mtype <— 1 

for dest = 1 to numworkers do 
if dest < extra then 

rows -f- averow + 1 
else 

rows averow 
end if 

{Master Sending Process} 

MPI_Send{&coffset, 1, MPI_INT, dest, mtype, MPI _COMM JWORLD) 
MPI_Send{&:rows, 1, MPI_INT, dest, mtype, MPI _COMM _WORLD) 
count -e- rows * mtrksA_col 

MPI _Send{&ca[of fset] [0], count, MPI_DOUBLE, dest, mtype, MPI jCOMM _WORLD) 
count <— mtrksA_col * mtrksB _col 

MPI_Send{&cb, count, MPI_DOUBLE, dest, mtype, MPI jCOMM _WORLD) 
offset offset + rows 
end for 

{Wait for results from all worker tasks} 

mtype -f- 2 

for j = 1 to numworkers do 

source _prcs <— i 

MPI^ecv(koffset, 1, MPI INT, source _prcs, mtype, M PI _C'OM M WORLD, kstatus) 
MPI _Recv{&zrows, 1, MPI_INT, source_prcs, mtype, MPI _COMM JWORLD, &cstatus) 
count *r- rows * mtrksB _col 

MPI_Recv{hc[offset] [o], count, MPI_DOUBLE, sourcejprcs, mtype, MPI _COMM _WORLD, kstatus) 
end for 

{Stop Counting Running Time} 
t_time *r- MPI _Wtime() — t_time 
{Counting Amont Of Matrix Multiplication} 
ops ■(- (2 * {j>ow{NRA, 3)) - (j>ow{NRA, 2))) 
{Counting Amont Of Matrix Multiplication Per Second} 
rate •(- ops/t_time/1000000.0 

end if 
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Figure 3.4: Matrix Multiplication For Each Worker Process 



one is for master process shown in Algorithm [2] and the other is for work process 
shown in Algorithm [3l 

Master process algorithm steps are: 

1. Value of matrixes is initialized using random function. 

2. Wall time calulation process started using MPI_Wtime() function. 

3. Row for each worker process is calculated including addtional residual value. 

4. Matrix A sub rows sent to all of worker process according to the row and offset 
variable. Offset variable will be iterated for each sent process. 

5. All of Matrix B elements sent to all of worker process. 

6. Master process wait to receive all of sub matrix results which will be sent from 
all worker processes. 
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Algorithm 3 MPI Worker Process Alogrithm 

{Worker Process} 
if taskid > then 

mtype -s— 1 

source _prcs <— 

{Retreiving values sent by master process} 

MPI _Recv{&zofiset, 1, MPI_INT, smircejprcs, mtype, MPI _COMM _WORLD, ^status) 
MPI _Recv{k.rows, 1, MPI_INT, smircejprcs, mtype, MPI _COMM _WORLD, hstatus) 
count *r- rows * mtrksA_col 

MPI _Recv{&ia, count, MPI _DOUBLE, source jprcs, mtype, MPI _COMM _WORLD, Scstatus) 
count <r- mtrksA_col * mtrksB _col 

MPlRecv{&zb, count, MPI_DOUBLE, source jprcs, mtype, MPI jCOMM _WORLD, Scstatus) 
{Sub matrix multiplication Calculation} 

for = to mtrksB col — 1 do 
for i = to rows — 1 do 
c[i][k] ^ 0.0; 

for j = to mtrksAcol — 1 do 

c\i][k]^c[{\[k] + a\i]\j]*bmk] 
end for 
end for 
end for 

{Sends the sub matrix back to the master process} 

MPI _Send{koffset, 1, MPI INT, MASTER, 2, MPI _COMM JWORLD) 
MPI _Send{k.rows, 1, MPI_INT, MASTER, 2, MPI _COMM _WORLD) 
MPI_Send{&cc, rows * NCB, MPI _DOUBLE, MASTER, 2, MPI _COMM JWORLD) 

end if 



7. Wall time calulation process stopped. Time interval between end and start 
time is calculated. 

8. Total matrix operation is calculated using formula N'^ * (2A^ — 1) in floating 
point type variable 

9. Matrix operation per second in FLOPS is calculated by dividing the total 
matrix operations by the matrix runtime interval. For simplicity, FLOPS is 
converted into MFLOPS (i.e. Mega Floating Point Operation Per Second) by 
dividing it again with (10®). 

Worker process algorithm steps are: 

1. Each worker process receive a subset rows of Matrix A, according to the offset 
variable sent from master process. 
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2. Each worker process receive all elements (rows * coUumns) of Matrix B which 
is sent from master process. 

3. Matrix multiplication process is done for each worker process. 

4. Each worker process send their sub rows of matrix C back to the master 
process. 

In DSM machine architectures, communication cost is classified into two dif- 
ferent types. First is communication cost between process which located on different 
machines, and second is communication cost between process which located on same 
machines. 

For Commmunication cost between worker and master process which located 
on different machines over the network for matrix distributions, can be roughly 
calculated as: 

• Cost distributing sub matrix A and matrix B to all worker process: (p-i) * 

((iV2) + (iV/(p_i))) * Which is equal to {{pN'^) - {N'^) + N) * t, (Where 
tc represents the time it takes to communicate one datum between processors 
over the network) 

• Cost for receiving matrix result from all worker process: (p-i) * (A^/(p_i)) * 

tc^N * tc- 

• Total Communication cost: {{pN'^) - {N'^) + 2N) * tc- 

For Commmunication cost between worker and master process which located 
on the same machines, its distribution process steps can be assumed to be same with 
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the communcation over the network with execption that time takes to communicate 
one datum between processors is not tc but tf- Where tf is represents the time it 
takes to communicate one datum between processors over the shared memory which 
is faster than tc- Thus communication over the shared memory can be calculated as 
((,iV2) - (N^) + 27V) * tf. 

Total communication cost between worker and master which located over the 
network and shared memory can be calculated as {{pN'^) - {N'^) + 2N) * tc + 
{{pN') - (N^) + 2N) * tf 

Matrix multiplication complexity in this parallel program is divided into 
amount of worker process which is p_i from the total amount of p process used. 
Thus, total matrix multiplication complexity in MPI for each worker process can be 
defined as 0(iVVp_i). 

3.1.2 OpenMP Workshare Algorithm 

Because OpenMP is an implementation of multithreading which uses multiple 
thread as it instruction carrier, OpenMP share their work among the amount of 
threads used in a parallel region. Thread classifed into two types: master thread 
and worker thread. 

By default, each thread executes the parallehzed section of code independent- 
ly. "Work-sharing constructs" can be used to divide a task among the threads so 
that each thread executes its allocated part of the code. Both Task parallelism and 
Data parallehsm can be achieved using OpenMP in this way. 

To determine how worksharing is done in OpenMP, OpenMP offer workshar- 
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ing construct feature. Using worksharing construct, programmer can easily dis- 
tribute work to each thread in parallel region in a well ordered manner. Currently 
OpenMP support four worksharing construct. 

• omp for: used to split up loop iterations among the threads, also called loop 
constructs. 

• sections: assigning consecutive but independent code blocks to different threads 

• single: specifying a code block that is executed by only one thread, a barrier 
is implied in the end 

• master: similar to single, but the code block will be executed by the master 
thread only and no barrier implied in the end. 

Since OpenMP is a shared memory programming model, most variables in 
OpenMP code are visible to all threads by default. But sometimes private variables 
are necessary to avoid race conditions and there is a need to pass values between 
the sequential part and the parallel region (the code block executed in parallel), so 
data environment management is introduced as data sharing attribute clauses by 
appending them to the OpenMP directive. The different types of clauses are: 

• shared: the data within a parallel region is shared, which means visible and 
accessible by all threads simultaneously. By default, all variables in the work 
sharing region are shared except the loop iteration counter. 

• private: the data within a parallel region is private to each thread, which 
means each thread will have a local copy and use it as a temporary variable. 
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A private variable is not initialized and the value is not maintained for use 
outside the parallel region. By default, the loop iteration counters in the 
OpenMP loop constructs are private. 

• default: allows the programmer to state that the default data scoping within a 
parallel region will be either shared, or none for C/C++, or shared, firstprivate, 
private, or none for Fortran. The none option forces the programmer to declare 
each variable in the parallel region using the data sharing attribute clauses. 

• firstprivatedike private except initialized to original value. 

• lastprivate: like private except original value is updated after construct. 

• reduction: a safe way of joining work from all threads after construct. 

Matrix multiplication workshare between threads in OpenMP, is done for each 
matrix row similiar to MPI. The difference is that MPI distribute its matrix element 
by sending it to all worker process, while OpenMP only need to declare the scope of 
matrix element variable as shared or private.Take an example, matrix multiplication 
between two NxN matrixes Aij and Bij which result will be contained in matrix 
Cij. Each Matrix has 4 rows (j = 4) and number threads used is 2 — 2). 

Rows distribution process done using workshare construct "pragma omp for" 
which is placed on the outer most of for loop repetitions. Thus, each thread (t) will 
responsible for calculating each matrix C i row for all j coUumn elements. 

The amount of rows distributed to number of threads is determined by sched- 
ule clause. There are 3 types schedule clause (static, dynamic, and guided). Static 
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schedule distribute iterations equally among all of threads (if there is a residual 
iteratons, threads which has done its job first will be assigned to work on that iter- 
ations). Dynamic and guided allows iterations to be assigned to threads according 
their chunk size defined by programmer. In this program, iteration distribution 
among threads will be done using static schedule. To understand better how does 
matrix multiplication done in OpenMP look at Figure 13.51 
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Figure 3.5: OpenMP Matrix Multiplication Algorithm Scheme 



Figure 13.51 scheme gives us matrix multiplication process for two different 
time executions (ti and t2). because there are 4 rows in matrixes and only 2 thread 
used in programs, there will be 2 residual rows which will be assigned again to those 
two threads in different time, in ti, 2 threads (ID and 1) will calculate first and 
second rows. After those two thread finished (assumed that time execution for each 
thread is same), in ^2 time, those 2 threads will calculate again for third and fourth 
rows. 

Determining how many threads should be used in a parallel region is quite 
tricky. For the same operation performed by all threads (e.g. matrix multiplication) 
the most optimal number threads used is the same amount as the total number of 
cores available. But if in a parallel region consist a various operation (e.g. print. 
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open file, read file, etc) using more than amount of CPU core might be a good idea. 
In this case, amount of thread can be determined by first, calculating operation cost 
for each different operations. Thus, number of threads used in matrix multiplication 
program must equal to the number of CPU cores. 

The difference between MPI and OpenMP in process management is that 
MPI needs its master process to do only a specific job, which is distributing matrix 
elements, receiving the result calculation, and generating matrix result apart from 
calculating sub matrix hke its workers. The reason behind this is to reduce parallel 
overhead when calculating a large of matrix dimension over network transmision. 
Hence, master process can focus only on managing and distributing data. 

Unlike MPI in OpenMP, process management and synchronization is done 
in the same memory (i.e shared memory) and not only master thread but all of 
thread also responsible of thread synchronization. Hence, master thread can also 
participate in the matrix multiplication process. 

OpenMP matrix multiplication algorithm steps are: 

1. Initializing matrixes value using random function 

2. Wall time calulation process started using omp_get_wtime() function. 

3. Matrix multiplication process for each thread is conducted using pragma om 
parallel directives and pragma omp for workshare construct. 

4. Wall time calulation process stopped. Time interval between end and start 
runtime calculation process is calculated. 
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Algorithm 4 OpenMPI Process Alogrithm 
{Initalizing Matrix A} 
for / = to DIM - 1 do 
for J = 0; DIM - 1 do 

MatrikA[I][J] ^ RAND{) 
end for 
end for 

{Initializing Matrix B} 
for J = to DIM - 1 do 

ior K = to DIM - 1 do 
MatrikB[J][K] ^ RANDQ 

end for 
end for 

(Start Calculating Running Time} 
tot_time <r- omp _get _wtime{) 
(Matrix Multiplication Process} 

^pragma omp parallel shared{MatrikA, MatrikB, MatrikC, DIM) private{I, K, J) 
# pragma omp for 
for / = to DIM - 1 do 
for = to DIM - 1 do 
MatrikC [I][K] ^ 
for J = to DIM - 1 do 

MatrikC[I][K] ^ MatrikC[I][K] + MatrikA[I][J] * MatrikB[J][K] 
end for 
end for 
end for 

(Stop calculating running time} 

tot_time omp_get_wtime{) — tot_time 

(Calculate the matrix operation} 

ops ^ (2 * {pow{DIM, 3)) - {pow{DIM, 2))) 

(Calculate the matrix operation per second} 

rate ^ ops /tot _time/ 1000000.0 
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5. Total matrix operation is calculated using formula N"^ * (2A'" — 1) in floating 
point type variable 

6. Matrix operation per second is calculated in FLOPS and then converted in 
MFLOPS which same with MPI. 

Unlike MPI, which communication is done using message passing, communi- 
cation cost in OpenMP is conducted between threads which is assumed to be fast 
and insignificant to the peformance. Thus its time calculation can be ignored in this 
algorithm. 

Matrix multiplication complexity in OpenMP parallel is divided into amount 
of threads which is t including master threads. Thus, total matrix multiplication 
complexity in OpenMPI for each thread can be defined as 0(A^^/t) (where number 
of threads is equal to number of CPU cores) . 



CHAPTER IV 
RESULT AND DISCUSSION 



4.1 Performance Test 

Matrix multiplication alogrithm tested ranging from dimension 100 up to 
2000 on a three different scenario: sequential algorithm, MPI algorithm, and Open- 
MP algorithm. For OpenMP and sequential program, test was done on a single Intel 
core duo 1.7 Ghz T5300 laptop with 1GB RAM running on hnux SUSE. For MPI 
program test was done on a two Intel core duo laptops one with frequency 1.7 GHz 
T5300 laptop with IGB RAM running on Windows Xp SP3 and another one with 
frequency 1.83 GHz T5500 with 1GB RAM running on Windows Xp SP3. 

Number of threads used in OpenMP is two (one master thread, and the 
other one is worker thread), unhke OpenMP, number of process used is three (two 
worker process and one is master process). The reason is already discussed in the 
previous section. However, the number of worker process / threads which performed 
the matrix multiplication process is equal for both programming models (i.e. two 
workers) . 

Because there are three processes used in MPI which will be distributed on 
two multicore / SMP machines (i.e each machines will have two cores), one of the two 
machines will have its both of core CPUs occupied (i.e master process and worker 
process). Computer which initiate the MPI ring daemon, has a master process in 
one of its core. Thus, computer machine with master process in it will also has a 
worker process, and the other machine will only has one worker process. 
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Stastical analysis conducted on one independent variable (i.e. Matrix Dimen- 
sion) towards three dependent variable (e.g. runtime program, throughput program, 
and speedup). Using this statistical analysis, matrix dimension (i.e. as an indepen- 
dent variable) influence towards all of three dependent variable can be seen clearly. 

4.2 Statistical Analysis Result 

4.2.1 Parallel Runtime Towards Matrix Dimension 

Fig 14.11 gives run time program obtained using wall time function for three 
different program(e.g. sequential, OpenMP, MPI). Wall time is the actual time 
taken by a computer to complete a task (i.e matrix multiplication). It is the sum 
of three terms: CPU time, I/O time, and the communication channel delay (e.g. if 
data are scattered on multiple machines (MPI)). 

In OpenMP, wall time is calculated using omp_get_wtime() which starts 
from when the inital thread enter the parallel region until it exits the parallel region. 
Thus, process time calculated are thread creation, synchronization and multiplica- 
tion tasks. 

In MPI, wall time is calculated using MPI_Wtime() which starts from when 
the master process distribute the work among the worker processes until it receives 
matrix results sent from all worker processes. Process time calculated in MPI are 
communication between master - worker process and matrix multiplication for all 
worker process. 

For N = 100, runtime MPI is much slower up to 10 times compared to se- 
quential and OpenMP. However For N = 100 to 2000 MPI runtime is gradually 
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become faster compared to those two. MPI has the fastest runtime performance for 
N > 500. 



Table 4.1: Running time comparison with various matrix dimension 



Matrix Dimension (N) 
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Figure 4.1: Running time comparison with various matrix dimension 



4.2.2 Parallel Throughput Towards Matrix Dimension 

Fig 14.2! gives througput result for three different program(e.g. sequential, 
OpenMP, MPI). Throughput in MFLOPS is calculated by dividing number of matrix 
operations by wall time and 10^. 

Both sequential and OpenMP has a throughput increase from N = 100 to 
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500, however, it starts to decrease from N = 500 to 2000. Nevertheless, MPI is 
different from the other two. It has a continously increased throughput start from 
N = 100 to 2000. Eventhough, the througput increase from N = 1000 to 2000 is not 
as significant as before. 



Table 4.2: Throughput comparison with various matrix dimension 



Matrix Dimension (N) 


Sequential (MFLOPS) 


OpenMP (MFLOPS) 


MPI (MFLOPS) 


100 


59.08 
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Figure 4.2: Throughput comparison with various matrix dimension 



4.2.3 Parallel Speedup Towards Matrix Dimension 

Fig 14.31 gives speedup performance for two program(e.g. OpenMP, MPI) to- 
wards sequential program. Speedup performance in this research can be obtained by 
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dividing wall— clock time of serial execution with wall— clock time of parallel execution 

OpenMP has a steady speedup for N = 100 to 2000 which has the average 
value at 1.5s, while MPI gives a linear speed up growth for N = 500 to 2000 ranging 
from 1.38s to 4s. MPI gives no speedup for N = 100 because the matrix calculation 
is to small compared to the MPI running time and communication time. 



Table 4.3: Speed comparison with various matrix dimension 
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Figure 4.3: Speed comparison with various matrix dimension 



CHAPTER V 
CONCLUDING REMARKS 



5.1 Conclusion 

OpenMP workshare between threads in matriks multiplication algorithm, 
done by using OpenMP FOR workshare construct. OpenMP FOR workshare in 
matrix multiplication algorithm is placed in the most outer loop of matrix multipli- 
cation operation. Using this placement, each OpenMP thread is assigned to work 
on each matrix C row for all coUumns. 

MPI workshare in matriks multiplication algorithm done by using send and 
receive command. Matrix A row will be divided and sent together with all matrix B 
elements according to the number of worker process both on the different machines 
and the same machines. Each worker process will work on one row of Matrix A 
multiplied by all row matrix B elements. If there are residual row, it will be added 
one each from the smallest worker process ID to the biggest worker ID. 

The performance between OpenMP and MPI programming model is vary 
for matrix dimension N from 1 to 2000, although many standarizations made for 
both of parallel programming models (e.g. number of matrix workers, matrix al- 
gorithm steps, and machine specifications). Matrix multiplication complexity is 
divided for the same number of worker (i.e threads if its OpenMP with the com- 
plexity of 0(A^^/p_i) and process if its MPI with the complexity of Qi^^/t))- 
Machine specifications used in MPI also comparable with OpenMP which are: Intel 
Core Duo 1.7 GHz (for OpenMP) and Intel Core Duo 1.7 GHz together with Intel 
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Core Duo 1.83 GHz both with 1 GB of RAM (for MPI). 

Performance dechne is common in every program testing performance ecspe- 
cially when the data testing becomes large. This is due to the resources hmitation 
(CPUs, memory, etc). However for different programming models which use the 
same resources and algorithm program, there are more reasons than just resources 
limitations. 

For sequential because the worker process only one, its obvious that its overall 
performance is lower than the other two. For the MPI and OpenMPI, differences 
can be caused by how fast the workshare is done between worker processes / thread. 
MPI use message passing in sharing its work across processes which has network 
communication time over the network medium. In the other hand, OpenMP use 
thread in sharing its work inside shared memory machine which has no network 
communication time. Hence, OpenMP workshare should be faster than MPI. 

This can be seen at Figures 14.11 14.21 and 14.21 OpenMP gains fast speedup 
and large throughput for N = 100 to 500 while MPI gains slower but steady speedup 
and throughput. However, when as N grows larger (i.e N = 500 to 2000) OpenMP 
performance is gradually become slower while MPI can still keep up with the growth 
of N. 

Besides, the speed of the workshare, performance differences between Open- 
MP and MPI for large N computation can be caused by core CPUs access to mem- 
ory. In parallel computing, memory is scalable with number of processors. Thus, 
increase in the number of processors and the size of memory will also increases the 
performance scalability. 
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MPI distribute its work by copying it on each worker process whether its 
located on the same memory or different memory machine. Thus, each processor 
which located on different machine can rapidly access its own memory without inter- 
ference and without the overhead incurred with trying to maintain cache coherency 
(i.e MPI provides strong memory locality). 

For this research, MPI is tested on distributed shared memory architectures 
using two SMP (Symmetric Multiprocessor) machines. MPI share two worker pro- 
cess between two different machines, thus MPI distribute the copy data located in 
different machines. Two worker process can access its data on its own memory which 
will reduced the overhead half compared on single memory. 

OpenMP in the other hand, use a single memory which is shared between 
core CPU in computer machine (UMA). Hence, the primary disadvantage is the 
lack of scalability between memory and CPUs. Adding more CPUs can geomet- 
rically increases traffic on the shared memory-CPU path, which leads to difficulty 
maintaining cache coherent systems between processors. 

Therefore, the larger memory used in OpenMP, the more congested traffic on 
the shared memory-CPU path which result in bottleneck. Increase in traffic asso- 
ciated with cache/memory management will produce more parallel overhead while 
maintaining cache coherency. OpenMP matrix program experienced this problem 
for N = 500 to 2000. The reason why the performance in OpenMP is decreasing 
starting from N = 500 to 2000 is because traffic on the shared memory-CPU path 
is gardually become more congested for memory equal to 1GB. This can be seen at 
Figure H73| where OpenMP does not give any more speedup than 1.5s. 
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5.2 Future Work 

Message Passing and Thread has been combined in a DSM (Distributed 
Shared Memory) parallel architecture to achieve a better performance results in 
nowdays. In this research MPI parallel expression used on shared memory archi- 
tectures, has not exploited the thread safety programming explicitly. Using thread 
safety expression, MPI can explicitly control the threads which running multiple 
cores accross SMP machines. This Message Passing - thread model also referred 
as Hybrid parallel programming model. In the next research. Hybrid parallel pro- 
gramming model will be used as evaluation material on DSM (Distributed Shared 
Memory) architecture. 
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